Using a bike instead of a car for short trips reduces the carbon footprint of that trip by about 75% [source]. Shared public systems help increase the use of bicycles for transportation. The United States Department of Transportation Bureau of Transportation Statistics (BTS) tracks bikeshare (docked and dockless) and E-scooter systems, and presents a nice summary of these micromobility systems at this site. Here is their summary video:
Let’s focus on bikeshare systems and load in data on them from BTS:
bikes_scooter<-read_csv("http://faculty.olin.edu/dshuman/DS/Bikeshare_Docked_and_Dockless_and_E-scooter_Systems_by_Year_and_City_Served.csv")%>%
janitor::clean_names()
bikeshare<-bikes_scooter%>%
select(city,state,year,dock_ct)%>%
filter(dock_ct>0)%>%
arrange(city,state,year)
| city | state | year | dock_ct |
|---|---|---|---|
| Long Beach | CA | 2021 | 86 |
| Blacksburg | VA | 2019 | 12 |
| Columbia | MD | 2018 | 9 |
| Los Angeles | CA | 2017 | 119 |
| Harrisburg | PA | 2019 | 11 |
| Fort Lauderdale | FL | 2020 | 21 |
We also have a second table from Wikipedia that has the population and land area for the 250 most populus cities in the United States, with all data from the 2020 Census:
top_250_cities_by_pop<-read_csv("http://faculty.olin.edu/dshuman/DS/us_top_250_cities_by_population.csv")%>%
janitor::clean_names()%>%
select(city,state,pop_2020,area_sq_miles)
| city | state | pop_2020 | area_sq_miles |
|---|---|---|---|
| Lincoln | NE | 291082 | 97.7 |
| Lexington | KY | 322570 | 283.6 |
| Pasadena | TX | 151950 | 43.7 |
| Boise | ID | 235684 | 84.0 |
| Victorville | CA | 134810 | 73.7 |
| Roseville | CA | 147773 | 44.1 |
Additional reading:
A join is a data verb that combines two tables.
There are several kinds of join.
A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.
As an example, we’ll examine different ways to combine the bikeshare and top_250_cities_by_pop tables.
The very first question to ask is what variables the two tables have in common. In this case, it is the city and the state.
The first class of joins are mutating joins, which add new variables (columns) to the left data table from matching observations in the right table.1
The main difference in the three mutating join options in this class is how they answer the following questions:
Three mutating join functions:
left_join(): the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left.inner_join(): the output has only the cases from the left with a match in the right.full_join(): the output has all cases from the left and the right. This is less common than the first two join operators.When there are multiple matches in the right table for a particular case in the left table, all three of these mutating join operators produce a separate case in the new table for each of the matches from the right.
One of the most common and useful mutating joins in one that translates levels of a variable to a new scale; e.g., a join that translates letter grades (e.g., “B”) into grade points (e.g., 3) for a GPA calculuation.
Example 2.1 (left_join) Let’s mutate two new columns to the bikeshare table by pulling in the population and land area from the top_250_cities_by_pop table:
bikeshare2<-bikeshare%>%
left_join(top_250_cities_by_pop,by=c("state"="state","city"="city"))
| city | state | year | dock_ct | pop_2020 | area_sq_miles |
|---|---|---|---|---|---|
| Long Beach | CA | 2021 | 86 | 466742 | 50.7 |
| Blacksburg | VA | 2019 | 12 | NA | NA |
| Columbia | MD | 2018 | 9 | NA | NA |
| Los Angeles | CA | 2017 | 119 | 3898747 | 469.5 |
| Harrisburg | PA | 2019 | 11 | NA | NA |
| Fort Lauderdale | FL | 2020 | 21 | 182760 | 34.6 |
A few notes:
We need to match both city and state, to distinguish for example between Rochester, NY and Rochester, MN.
In this case, the variable names we are matching happen to be the same, but they don’t need to be. If the right table name was ST, we’d just change that part of the match to "state"="ST".
If the by= is omitted from a join, then R will perform a natural join, which matches the two tables by all variables they have in common. In this case, that would yield the same result (but you should be careful in general):
bikeshare2<-bikeshare%>%
left_join(top_250_cities_by_pop)
## Joining with `by = join_by(city, state)`
Example 2.2 (inner_join) The bikeshare2 table has a lot of rows corresponding to cities that aren’t in the 250 most populus. and thus have NAs in the last two columns. If we want to create a table that only has information for cities that are in both tables, we can instead use inner_join:
bikeshare3<-bikeshare%>%
inner_join(top_250_cities_by_pop,by=c("state"="state","city"="city"))
| city | state | year | dock_ct | pop_2020 | area_sq_miles |
|---|---|---|---|---|---|
| Salt Lake City | UT | 2018 | 34 | 199723 | 110.3 |
| Chicago | IL | 2019 | 608 | 2746388 | 227.7 |
| Honolulu | HI | 2018 | 136 | 350964 | 60.5 |
| San Antonio | TX | 2021 | 55 | 1434625 | 498.8 |
| New York | NY | 2021 | 1566 | 8804190 | 300.5 |
| Milwaukee | WI | 2017 | 86 | 577222 | 96.2 |
The second class of joins are filtering joins, which select specific cases from the left table based on whether they match an observation in the right table.
semi_join(): discards any cases in the left table that do not have a match in the right table. If there are multiple matches of right cases to a left case, it keeps just one copy of the left case.anti_join(): discards any cases in the left table that have a match in the right table.A particularly common employment of these joins is to use a filtered summary as a comparison to select a subset of the original cases, as follows.
Example 2.3 (semi_join to compare to a filtered summary) Compute a subset of the bikeshare table that only contains rows from the top three most populus cities in the United States.
Solution. This is where we can use semi_join to filter the rows of bikeshare:
top_three_populus<-top_250_cities_by_pop%>%
arrange(desc(pop_2020))%>%
head(n=3)
bikeshare_in_populus_cities<-bikeshare%>%
semi_join(top_three_populus)
| city | state | year | dock_ct |
|---|---|---|---|
| Chicago | IL | 2015 | 469 |
| Chicago | IL | 2016 | 582 |
| Chicago | IL | 2017 | 574 |
| Chicago | IL | 2018 | 608 |
| Chicago | IL | 2019 | 608 |
| Chicago | IL | 2020 | 612 |
| Chicago | IL | 2021 | 671 |
| Chicago | IL | 2022 | 1367 |
| Chicago | IL | 2023 | 1636 |
| Los Angeles | CA | 2016 | 104 |
| Los Angeles | CA | 2017 | 119 |
| Los Angeles | CA | 2018 | 127 |
| Los Angeles | CA | 2019 | 127 |
| Los Angeles | CA | 2020 | 213 |
| Los Angeles | CA | 2021 | 218 |
| Los Angeles | CA | 2022 | 219 |
| Los Angeles | CA | 2023 | 223 |
| New York | NY | 2015 | 507 |
| New York | NY | 2016 | 664 |
| New York | NY | 2017 | 811 |
| New York | NY | 2018 | 841 |
| New York | NY | 2019 | 841 |
| New York | NY | 2020 | 1011 |
| New York | NY | 2021 | 1566 |
| New York | NY | 2022 | 1703 |
| New York | NY | 2023 | 1954 |
Example 2.4 (anti_join) Which U.S. cities in the 250 most populus have no bikeshare program listed in the BTS data?
Solution. While semi_join keeps rows of the left table that have a match in the right table, anti_join keeps rows that do not have a match:
populus_cities_no_bike_scooter<-top_250_cities_by_pop%>%
anti_join(bikeshare)%>%
arrange(desc(pop_2020))
| city | state | pop_2020 | area_sq_miles |
|---|---|---|---|
| San Jose | CA | 1013240 | 178.3 |
| Jacksonville | FL | 949611 | 747.3 |
| Fresno | CA | 542107 | 115.2 |
| Sacramento | CA | 524943 | 98.6 |
| Mesa | AZ | 504258 | 138.7 |
| Colorado Springs | CO | 478961 | 195.4 |
Here is an additional table from Wikipedia with the top 150 United States cities by land area:
top_150_cities_by_area<-read_csv("http://faculty.olin.edu/dshuman/DS/us_top_150_cities_by_land_area.csv")%>%
janitor::clean_names()%>%
arrange(desc(land_area_mi2))
| city | st | land_area_mi2 | land_area_km2 | water_area_mi2 | water_area_km2 | total_area_mi2 | total_area_km2 | population_2020 |
|---|---|---|---|---|---|---|---|---|
| Sitka | AK | 2870.1 | 7434 | 1945.1 | 5038.0 | 4815.1 | 12471 | 8458 |
| Juneau | AK | 2704.2 | 7004 | 550.7 | 1426.0 | 3254.9 | 8430 | 32255 |
| Wrangell | AK | 2556.1 | 6620 | 920.6 | 2384.0 | 3476.7 | 9005 | 2127 |
| Anchorage | AK | 1707.0 | 4421 | 239.7 | 621.0 | 1946.7 | 5042 | 291247 |
| Tribune | KS | 778.2 | 2016 | 0.0 | 0.0 | 778.2 | 2016 | 1182 |
| Jacksonville | FL | 747.3 | 1935 | 127.2 | 329.0 | 874.5 | 2265 | 949611 |
| Anaconda | MT | 736.7 | 1908 | 4.7 | 12.0 | 741.4 | 1920 | 9421 |
| Butte | MT | 715.8 | 1854 | 0.6 | 1.6 | 716.3 | 1855 | 34494 |
| Houston | TX | 640.6 | 1659 | 31.2 | 81.0 | 671.8 | 1740 | 2304580 |
| Oklahoma City | OK | 606.5 | 1571 | 14.3 | 37.0 | 620.8 | 1608 | 681054 |
Exercise 2.1 Use your wrangling skills to answer the following questions. Hint: start by thinking about what tables you might need to join (if any), determining which is the left table and which is the right table, and identifying the corresponding variables to match.
Which of the 250 most populus cities in the United States had the most shared bicycle docking stations per square mile in 2022?
Which of the 20 most populus cities in the United States has the most land area per 100000 inhabitants?
Find a list of all of the 20 largest U.S. cities by land area that had a bikeshare program listed in the BTS data, any time between 2015 and 2023.
How many U.S. cities are in the top 150 by population, but not in the top 150 by land area? Hint: It is an even number!
# a)
docks_per_sq <-
bikeshare3 %>%
mutate(docks_per_mile = dock_ct/area_sq_miles) %>%
arrange(desc(docks_per_mile)) %>% # arranging in descending order of sq mile
filter(year == "2022")
# b)
answer_b <-
top_250_cities_by_pop %>%
mutate(land_per_pop=(100000*area_sq_miles)/pop_2020) %>% # land per population 100000
arrange(desc(land_per_pop)) %>%
head(n=20)
# c) COME BACK
bikeshare_15_23 <- # filtering all the cities thats not the right year
bikeshare %>%
filter(year %in% c('2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'))
largest_bikeshare_15_23 <-
top_250_cities_by_pop %>% # has no duplicates
semi_join(bikeshare_15_23, join_by(city, state))%>% # semi joining the
arrange(desc(area_sq_miles)) %>% # top cities by land area
head(n=20)
# d)
top_150_pop <-
top_250_cities_by_pop %>%
arrange(desc(pop_2020)) %>% # arranging by top population
head(n=150) # taking only top 150
top_150_pop_and_area <-
top_150_pop %>%
anti_join(top_150_cities_by_area) %>%
slice(-1) # will remove firstmost row, NY
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
Figure 3.1: A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.
Figure 3.2: One of the vans used to redistribute bicycles to different stations.
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data:2
# Trips <- read_csv("http://faculty.olin.edu/dshuman/DS/2014-Q4-Trips-History-Data-Small.csv")
Trips <- read_csv("http://faculty.olin.edu/dshuman/DS/2014-Q4-Trips-History-Data.csv")
Stations<-read_csv("http://faculty.olin.edu/dshuman/DS/DC-Stations.csv")
The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small from the file name.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started.
Exercise 3.1 (Single variable temporal plots) Make the following plots and interpret them:
sdate. Use ggplot() and geom_density().mutate with lubridate::hour(), and lubridate::minute() to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a field where 3:30 is 3.5 and 3:45 is 3.75.# a)
ggplot(Trips, aes(x=sdate))+
geom_density(fill="red")+
labs(x="Starting Date of Trip", y="Density", title="Density Plot of Starting Date of Bike Trips")
# b)
Trips <-
Trips %>%
mutate(hour=lubridate::hour(sdate),minute=((lubridate::minute(sdate)*10)/6)*0.01, time_of_day = hour+minute)
ggplot(Trips, aes(x=time_of_day))+
geom_density(fill='blue')+
labs(x="Time of Day of Trip Start", y="Density", title="Density Plot of Starting Time of Day of Bike Trips")
# c)
Trips <-
Trips %>%
mutate(day_of_week = wday(sdate, label=TRUE))
ggplot(Trips, aes(x=day_of_week))+
geom_bar()+ # we use bar insead of histogram bc x is discrete values
labs(x="Day of the Week", y="Frequency of Trips", title="Frequency of Trips Taken by Day of the Week")
# d)
ggplot(Trips, aes(x=time_of_day))+
geom_density(fill='blue')+
facet_grid(rows=vars(day_of_week), scale="free")+
labs(x="Time of Day that Trip Started", y="Density", title="Density Plot of Starting Time of Day of Bike Trips")
The pattern we can see from this visual is that on the weekdays, there are spikes of significantly more bike usage during the hours of 9am and 5pm when people will commonly get out of work. On the weekends, in contrast, the usage of bikes slowly surge starting 9am and will slowly go down at around 5pm, but there is no decrease in bike usage in between like during the weekday. Rather, the bike usage actually peaks around noontime/late afternoon.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). Do you think these two different categories of users show different rental behavior? How might it interact with the patterns you found in Exercise 3.1?
Exercise 3.2 (Customer segmentation) Repeat the graphic from Exercise 3.1 (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You may also want to set the alpha for transparency and color=NA to suppress the outline of the density function.position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday")). What does the variable wday represent? Try to understand the code.wday and fill with client, or vice versa?# a)
ggplot(Trips, aes(x=time_of_day, fill=client))+
geom_density(alpha=0.35, color=NA)+
facet_grid(rows=vars(day_of_week), scale="free")+
labs(x="Time of Day that Trip Started", y="Density", title="Density Plot of Starting Time of Day of Bike Trips")
# b)
ggplot(Trips, aes(x=time_of_day, fill=client))+
geom_density(alpha=0.35, color=NA, position=position_stack())+
facet_grid(rows=vars(day_of_week), scale="free")+
labs(x="Time of Day that Trip Started", y="Density", title="Density Plot of Starting Time of Day of Bike Trips")
I personally feel that adding the stack makes it more difficult to differentiate between the separate behaviors of each crowd. Stacking, however, has the advantage of showing how each type of client contributes to the total/overall frequency of bike trips (as far as I know).
# c)
Trips <-
Trips %>%
mutate(wday=ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))
# example of faceting by client, and filling by day of week
ggplot(Trips, aes(x=time_of_day, fill=day_of_week))+
geom_density(alpha=0.35, color=NA, position=position_stack())+
facet_grid(rows=vars(client), scale="free")+
labs(x="Time of Day that Trip Started", y="Density", title="Density Plot of Starting Time of Day of Bike Trips")
e) I really liked the plot made in ex 3.2a (without position stack) because you can clearly see the differences in bike usage between the two types of clients. You can see that registered clients have the pattern of using the bikes on a routine during the weekdays, while causal clients on the weekdays use it at their leisure, regardless of popular clock in/out hours.
Exercise 3.3 (Visualization of bicycle departures by station) Use the latitude and longitude variables in Stations to make a spatial visualization of the total number of departures from each station in the Trips data. Your map can be static or interactive via leaflet. Here is a good bounding box:
dc_bbox<-c(-77.1,38.87,-76.975,38.95)
leaflet(data=Stations) %>%
addTiles() %>%
addMarkers(lng=~long,
lat=~lat,
label=~name)
Exercise 3.4 Only 14.4% of the trips in our data are carried out by casual users.3 Create a map that shows which area(s) of the city have stations with a much higher percentage of departures by casual users. Interpret your map. Hint: you may want to exclude stations with low overall traffic (e.g., under 20 total departures).
group by station and casual summarize ( counting per staton num o casual users )
percent_departures <-
Trips %>%
group_by(sstation, client) %>% # to grab total visits of cas/reg clients
summarize(total_visitors=n()) %>% # counting up visitors of that client type
pivot_wider(names_from=client, values_from=total_visitors) %>% # making clients into columns to condense tables
mutate(percent_cas = (Casual*100)/(Casual+Registered)) %>% # percentages calculated
arrange(desc(percent_cas)) %>%
inner_join(Stations, by=c("sstation"="name")) %>% #adding long/lat columns
select(-nbBikes, -nbEmptyDocks) %>%
filter(percent_cas > 30)
leaflet(percent_departures)%>%
addTiles() %>%
addMarkers(lng=~long,
lat=~lat,
label=~percent_cas)
Exercise 3.5 (High traffic points)
as_date(sdate) converts sdate from date-time format to date format.wday (weekend/weekday), and count the total number of trips in each of the four groups. Interpret your results.# a)
station_date_highest_dep <-
Trips %>%
# filter(client=="Casual") %>%
mutate(date=as_date(sdate)) %>% # a column w just the date to filter
group_by(sstation, date) %>% # grouping by station and date to then summarize #
summarize(departures=n()) %>%
# mutate(week_day=wday(date, label=TRUE)) %>%
arrange(desc(departures)) %>%
head(10)
# b) semi join
station_join <-
Trips %>%
mutate(date=as_date(sdate)) %>% # adding date column
semi_join(station_date_highest_dep, by=c("sstation"="sstation", "date"="date")) %>% # discarding cases from Trips that dont align w the station/date combo
select(-duration, -sdate, -edate, -estation, -bikeno, -hour, -minute, -time_of_day, -day_of_week) %>%
head(10)
# c)
station_join <-
station_join %>%
group_by(client, wday) %>%
summarize(trips_total = n()) %>% # filter(week_day %in% c("Mon")) %>%
pivot_wider(names_from=c(client, wday), values_from=trips_total)
station_join
## # A tibble: 1 × 1
## Registered_weekday
## <int>
## 1 10
There is also a right_join() that adds variables in the reverse direction from the left table to the right table, but we do not really need it as we can always switch the roles of the two tables.↩︎
Important: To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.↩︎
We can compute this statistic via mean(Trips$client=="Casual").↩︎